AllLife Bank Project

Background:

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objectives:

  1. To predict whether a liability customer will buy a personal loan or not.
  2. Which variables are most significant.
  3. Which segment of customers should be targeted more.

Data Dictionary:

1 Loading the Libraries

2 Load and explore the data

2.1 Read the data file

2.2 Explore the data

Observations

2.3 Data Quality

2.3.1 Null Values

2.3.2 Duplicate Values

2.3.3 Data Summary

Observations

3 Exploratory Data Analisys (EDA)

3.1 Univariate Data Analysis

Age

Observations

Experience

Observations

Income

Observations

Family

Observations

CCAvg

Observations

Education

Observations

Mortgage

Observations

3.2 Bivariate Data Analysis

Income vs Age

Observations

Income vs Education

Observations

Income vs Mortgage

Observations

Income vs Family

Observations

Personnal Loan vs CD Account

Observations

Personnal Loan vs Securities Account

Observations

4 Outliers

4.1 Vizualizing Outliers

Observations

4.2 Dealing with Outliers

Income

Credit Card Average

Morgage

4.3 Data after Outlier Treatment

Observations

5 Data Transformation

5.1 Dropping columns and removing negative values

5.2 Converting "Income" and "Mortgage" to monthly amounts

Observations

6 Correlations

Observations

Observations

7 Identifying Liability Customers

*1549 customers have a negative balance

Observations

8 Model

8.1 Split Data

Observations

8.2 Training Data

8.2.1 Building Logistic Regression

8.2.2 Prediction on Training Data

8.2.3 Confusion Matrix on Training Data

Observations

8.3 Testing Data

8.3.1 Prediction on Testing Data

8.3.2 Confusion Matrix on Testing Data

8.3.3 Model Score Testing Data

Observations

8.4 Accuracy Score

8.4.1 Accuracy Calculation

8.4.2 Area Under the Curve

Observations

8.5 Optimal Threshold

8.5.1 Calculating Threshold

8.5.2 Confusion Matrix Testing vs Threshold

8.5.3 Accuracy with Optimal Threshold

Observations

8.6 Multicollinearity

8.6.1 Checking for multicolinearity

8.6.1.2 Checking for Perfect Colinearity

Observations

9 Building Logistic Regression from Stats Model

9.1 Split Data

9.2 Test Data

9.2.1 Building Model

9.2.2 Summary

Observations

9.2.3 Calculate the Odds Ratio

Calculate the odds ratio from the coef using the formula odds ratio=exp(coef) Calculate the probability from the odds ratio using the formula probability = odds / (1+odds)

Observations

9.3 Most Significant Variable

9.4 Prediction on Train Data

9.4.1 Confusion Matrix

9.5 Prediction on Test Data

9.4.1 Confusion Matrix

9.6 Accuracy

9.6.1 Accuracy Calculation

9.6.2 Area Under the Curve (AUC)

Observations

9.7 Optimal Threshold

9.7.1 Calculating Threshold

9.7.2 Confusion_Matrix

9.7.2.1 Confusion_Matrix Training Data
9.7.2.2 Confusion_Matrix Test Data

9.7.3 Accuracy

Observations

10 Decision Tree Classifier

10.1 Split Data

Observations

10.2 Model Building

10.3 Scoring the Decision Tree

10.3.1 Accuracy

10.3.2 Confusion Matrix

10.3.2.1 Confusion Matrix Training Data
10.3.2.2 Confusion Matrix Test Data

10.3.3 Recall Score

Observations

10.3 Vizualizing the Decision Tree

10.4 Feature Importance

Observations

10.5 Using GridSearch for Hyperparameter Tuning of the Tree Model

Observations

10.5.1 Confusion Matrix

10.5.2 Recall

10.5.3 Vizualizing pruned tree

Observations

10.6 Cost Complexity Pruning

10.6.1 Impurities vs Alpha

Observations

10.6.2 ccp alpha

Decision tree trained using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
For the remainder, we remove the last element in clfs and ccp_alphas, the chart will show that less depth means less overfitting

Observations

10.6.3 Accuracy vs alpha for training and testing sets

Observations

When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 88% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better. In this example, setting ccp_alpha=0.015 maximizes the testing accuracy.

10.6.4 Recall vs alpha for training and testing sets

Observations - update

10.6.5 Visualizing the Decision Tree

10.6.6 Feature Importance

Observations

11 Comparing all the decision tree models

Observations

12 Conclussion and recomendations

Best prediction:

Most Significant variables:

Segments to be targeted: